feat: split out @databricks/lakebase-auth with eager refresh and retry#461
feat: split out @databricks/lakebase-auth with eager refresh and retry#461jawj wants to merge 9 commits into
Conversation
Extract OAuth credential/token-refresh logic from @databricks/lakebase. Move these to a new, lightweight @databricks/lakebase-auth package. Add eager (default) and lazy refresh, plus retry on transient failures. Signed-off-by: George MacKerron <georgemackerron@neon.tech>
|
…, update README usage examples
…echeck and packaging Signed-off-by: George MacKerron <georgemackerron@neon.tech>
Signed-off-by: George MacKerron <georgemackerron@neon.tech>
pkosiec
left a comment
There was a problem hiding this comment.
Thank you George for your contribution - this looks awesome!
Took a quick look (didn't test) but my general question would be: does it make sense to introduce a separate package right now? It brings more maintenance burden and I'm not sure if the benefits outweighs the increased complexity.
I'd vote for making all the exports part of the Lakebase package directly, and later on we could split it if there will be such requests from the community. What do you think?
There was a problem hiding this comment.
This is just a half of the changes required for the release - we'll need to:
- do the second part in the secure release repo (polling + actual release)
- create and configure the new package on npm side (with Fabian)
- allowlist the package in both internal npm proxies
I can take care of those items after my OOO 👍
There was a problem hiding this comment.
What are the reasons to extract this as a separate package already? Is it only about the size or do you have other concerns?
I totally get the idea of exporting such useful auth-related helpers - but wouldn't it be easier if we did it as a part of the @databricks/lakebase package from the maintenance point of view?
pkosiec
left a comment
There was a problem hiding this comment.
Code review — @databricks/lakebase-auth split
Automated multi-reviewer pass (correctness, security, adversarial, reliability, api-contract, testing, maintainability, project-standards). Load-bearing findings were verified against source. No P0/P1 code defects. 11 inline comments below (P2/P3). The DCO sign-off issue is intentionally excluded per request.
Priority order
- Eager-timer leak —
getLakebasePgConfig/getLakebaseOrmConfigdropdispose, leaking an unstoppable background refresh timer (regression). (pool-config.ts:166) - SSL
preferbreaking change — now verifies certs; document + bump@databricks/lakebaseminor. (config.ts:46) - Retry classification + min-refresh floor — together they prevent credential-API storms on misconfig/clock-skew. (retry.ts, caching.ts)
- Release pin/ordering + knip (below).
- Span leak, over-wide exports, lazy-mode token discard, and the P3s as follow-ups.
CI / template (looks good)
pack:prereleasepacks all four@databricks/*tarballs (lakebase-auth hastarball:prerelease), andprepare-template-artifact.tscopies the lakebase-auth tarball + sets itsoverridesentry — so the PR template artifact resolves the new package.prepare-release-lakebase-auth.ymlcorrectly mirrors the lakebase two-stage workflow (build → dist → SBOM → pack → SHA256 → upload; no publish/tag/push).- New package meets the CLAUDE.md "Creating New Packages" bar (
build:package/build:watch, extends root tsconfig); typecheck wiring already fixed on this branch.
One CI gap with no diff line to anchor (#9)
knip.json does not exempt packages/lakebase-auth, while it does exempt packages/lakebase (ignoreWorkspaces). So pnpm run knip (CI lint job) analyzes the new package, and combined with the over-wide exports (see the index.ts comment) it may fail. Run pnpm knip locally; either add lakebase-auth to ignoreWorkspaces for consistency, or trim the exports.
Testing gaps (suggestions, not gates)
queryWithTelemetrywrapper (pool.ts:90-148) is untested — onlypool.query.nameis asserted; the promise/callback branches, error-span recording,queryDuration, anddb.rows_affectedare never exercised.- Eager-mode in-flight dedup is untested (tests flush the initial fetch before
getToken()). - No test that eager-mode
getLakebasePgConfig/getLakebaseOrmConfigavoid leaking a timer; no test thatwithRetriesdoes not retry a non-transient error; no negative assertion that tokens never appear in logs/errors; no near/at-expiry orNaN-expiry scheduling test.
Residual risks (non-blocking)
sslMode: "disable"→ no TLS, token sent in cleartext (opt-in, defaultverify-full); worth a doc warning.- No wall-clock timeout on the credential fetch (SDK socket ~5s is the only bound; worst-case retry time can exceed
pg.PoolconnectionTimeoutMillis). getUsernameWithApiLookupswallows all errors → silentundefinedusername.getLakebaseOrmConfigdropsserverNamewhen narrowingssl→ Bun.SQL via the ORM path loses SNI.
| idleTimeoutMillis: poolConfig.idleTimeoutMillis, | ||
| connectionTimeoutMillis: poolConfig.connectionTimeoutMillis, | ||
| }; | ||
| return buildLakebasePgConfig(config, telemetry, logger).poolConfig; |
There was a problem hiding this comment.
P2 (correctness, reliability) · confidence 100 — getLakebasePgConfig returns buildLakebasePgConfig(...).poolConfig and discards dispose; getLakebaseOrmConfig builds on it and exposes no dispose either. With the new mode: "eager" default, createPasswordProvider starts an unref'd background timer that re-fetches an OAuth credential roughly hourly forever, per call, with no way to stop it — a regression from the old lazy (timer-free) behaviour for the documented TypeORM/Sequelize path. createLakebasePool is unaffected (it wraps pool.end → dispose).
Fix: default these two helpers to refresh: "lazy" (matches old behaviour), or return { ...config, dispose } and document calling it on shutdown.
| @@ -44,36 +163,7 @@ export function getLakebasePgConfig( | |||
| telemetry?: DriverTelemetry, | |||
There was a problem hiding this comment.
P3 (maintainability) · confidence 100 — telemetry?/logger? are vestigial: every caller passes only config, yet they're part of the exported getLakebasePgConfig signature. Likewise DriverTelemetry and TokenRefreshDeps (the latter tied to the deprecated createTokenRefreshCallback) are re-exported from packages/lakebase/src/index.ts with no importer outside the package.
Fix: drop the dead params and trim those two exports.
| new Date(expiresAt).toISOString(), | ||
| ); | ||
| span.setStatus({ code: SpanStatusCode.OK }); | ||
| span.end(); |
There was a problem hiding this comment.
P2 (reliability) · confidence 100 — span.end() only runs on the happy path. When generateDatabaseCredential throws, the catch at L79 rethrows without ending the span (raw OTel startActiveSpan does not auto-end on throw). Every failed refresh leaks an unclosed span.
Fix: try { … span.setStatus({code: OK}) } catch (e) { span.recordException(e); span.setStatus({code: ERROR}); throw e } finally { span.end() }.
| for (const retryDelay of schedule) { | ||
| try { | ||
| return await fn(); | ||
| } catch (err) { |
There was a problem hiding this comment.
P2 (adversarial, reliability) · confidence 100 — withRetries retries on every thrown error. A non-transient failure (403/permission, ConfigurationError, ValidationError) is still retried across the full [50, 500, 5000]ms schedule (~5.5s) before surfacing — and in eager mode this compounds with the 30s background reschedule, so a misconfigured endpoint fires 4 doomed fetches every 30s for the process lifetime. The docs promise transient retries.
Fix: add a shouldRetry?: (err) => boolean predicate (default-true for back-compat) and fast-fail 4xx/config/validation errors.
| case "verify-full": // since JS drivers check root certs, there's an implied sslrootcert=system | ||
| case "verify-ca": // upgraded to equivalent of verify-full, sslrootcert=system | ||
| case "require": // upgraded to equivalent of verify-full, sslrootcert=system | ||
| case "prefer": { |
There was a problem hiding this comment.
P2 (api-contract; security: improvement) · confidence 100 — prefer (and require) now fall through to { rejectUnauthorized: true }, and the default changed require → verify-full. Old @databricks/lakebase mapped prefer → { rejectUnauthorized: false }. This is a security improvement, but a silent breaking behaviour change: a deployment relying on prefer/PGSSLMODE=prefer against an unverifiable cert will now fail to connect.
Fix: add a BREAKING CHANGE changelog entry and bump @databricks/lakebase minor (0.3.0 → 0.4.0); a fix:/chore: commit only yields a patch.
| BACKGROUND_RETRY_MS, | ||
| err instanceof Error ? err.message : String(err), | ||
| ); | ||
| scheduleNext(BACKGROUND_RETRY_MS); |
There was a problem hiding this comment.
P3 (reliability) · confidence 75 — scheduleNext(BACKGROUND_RETRY_MS) reschedules a failed background refresh exactly 30s later with no jitter. Multiple pools/processes started together that hit the same outage all retry at the same instants (T+30s, T+60s…) — a synchronized thundering herd.
Fix: add jitter, e.g. BACKGROUND_RETRY_MS + Math.random() * BACKGROUND_RETRY_MS * 0.5.
|
|
||
| return { | ||
| getToken() { | ||
| if (Date.now() > refreshAfter) cachedToken = refresh(); |
There was a problem hiding this comment.
P2 (adversarial) · confidence 80 — getToken() does cachedToken = refresh(), overwriting the cache with a possibly-rejecting promise while the previous token is still valid for up to earlyRefreshMs (default 2 min). A transient outage in that pre-expiry window makes lazy mode fail 100% of new connections even though a usable token was in hand (eager mode doesn't have this).
Fix: keep the last successfully-resolved token+expiry separately and serve it until actual expiry; only surface the rejection once no valid token remains.
| getUsernameSync, | ||
| getUsernameWithApiLookup, | ||
| getWorkspaceClient, | ||
| mapSslConfig, |
There was a problem hiding this comment.
P2 (maintainability, api-contract) · confidence 100 — parseConfig, ParsedAuthConfig, mapSslConfig, getUsernameSync, withRetries, DEFAULT_RETRY_SCHEDULE are exported from the package entry but imported by no consumer outside this package. Once published they become committed public API (breaking to remove later).
Fix: trim the surface to getPgConfig, createPasswordProvider, generateDatabaseCredential, getUsernameWithApiLookup, errors and types; keep the rest as internal module exports. (Also addresses the knip note in the review summary.)
| `../packages/${pkgDirName}/package.json`, | ||
| ); | ||
| const depPkg = JSON.parse(fs.readFileSync(depPkgPath, "utf-8")); | ||
| pkg.dependencies[depName] = `${depPkg.version}`; |
There was a problem hiding this comment.
P2/P3 (api-contract, project-standards) · confidence 75/70 — two issues on this version pin:
- Exact pin + release ordering: lakebase's
workspace:*dep on@databricks/lakebase-authis rewritten to the exact version. The twoprepare-release-*workflows are independent and path-triggered, so nothing guaranteeslakebase-auth@0.1.0publishes before the lakebase release that requires it (first publish breaks otherwise); the exact pin also blocks lakebase-auth patch fixes from reaching lakebase consumers without re-releasing lakebase. Fix: emit`^${depPkg.version}`and coordinate lakebase-auth-first on the initial publish. - Prerelease mismatch: in a prerelease build the lakebase-auth tarball is
0.1.0-pr.<sha>but this still pins plain0.1.0. It only resolves in the template becauseprepare-template-artifact.tsadds anoverridesentry (matched by name); installing the lakebase prerelease tarball directly would try to resolve0.1.0from the registry → stale/missing.
| const provider = cachedWithTimedRefresh(fetch, EARLY, onLog); | ||
| await vi.advanceTimersByTimeAsync(0); // initial attempt fails | ||
| expect(fetch).toHaveBeenCalledTimes(1); | ||
| expect(onLog).toHaveBeenCalled(); |
There was a problem hiding this comment.
P3 (testing) · confidence 95 — expect(onLog).toHaveBeenCalled() only proves the function ran; it won't catch a wrong level, format string, or args. (Same pattern at pool-config.test.ts:193 with logger.error.) Compare retry.test.ts, which asserts all four args.
Fix: expect(onLog).toHaveBeenCalledWith('warn', expect.stringContaining('retrying'), 30000, expect.stringContaining('boom')).
Extract OAuth credential/token-refresh logic from @databricks/lakebase.
Move these to a new, lightweight @databricks/lakebase-auth package.
Add eager (default) and lazy refresh, plus retry on transient failures.